As the housing market is highly replied on the price predictions for both sellers and buyers, it is essential to make a highly accurate prediction model for them. Zillow is an American real estate company that has a large database of homes for sale and rent. However, because the range of their data is across the United States, it is hard for them to make a precise housing price prediction in a specific city. The purpose of the project is to help Zillow to make a prediction model of the housing price in Philadelphia based on the local characteristics. The model would provide a strong guide for sellers and buyers to make decisions so that they would not overpay much. It would also reduce the risks of credibility and enhance the market efficiency.
The overall modeling strategy for the project is using Ordinary Least Squares (OLS) regression with multi-variables, which is a popular statistical method used to examine the relationship between a dependent variable and predictors that might affect the value of the dependent variable. The method provides the strength of the relationship, the direction of the relationship and the goodness of the model fit. In our case, there are still some challenges we should overcome. First of all, we need to filter out the most significant and diverse variables to use in this model and find the data from the reliable sources. Second, we need to provide strong evidence and clear visualizations of plots and maps with comprehensive interpretations for the audience. Third, we should state the limitations of the OLS regression results. There are still unpredictable factors in the real world, such as Covid-19 pandemic, which impact the housing prices negatively.
Our results of the model indicates that using the local data can enhance the accuracy of housing price predictions for Philadelphia.
To decide the specific variables that might affect the housing prices, we start from three categories: interior conditions, neighborhood environments, and demographic characteristics. According to Zillow’s website, the living area, the house age and utilities are listed in the overview section in each house’s information, which are the top factors that people consider the most when they buy a house. Therefore, we chose the column “total_area”, “year_built”, “quality_grade” in the original dataset to represent these three factors. For the environmental factors, people normally consider the safety and convenience to the public resources, so we explored the OpenDataPhilly website and used the data of shooting victims, commercial corridors and schools. For the demographics characteristics, we retrieved the census data in Philadelphia in 2021 from U.S. Census Bureau, which contains the median household income, percentage of foreign-born residents, and the geographic mobility to a different house.
# Load Libraries
library(geos)
library(rsgeo)
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(knitr)
library(dplyr)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(stargazer)
options(scipen=999)
options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
Since our project is heavily focused on geographic features, we retrieved the block groups and neighborhoods boundaries from OpenDataPhilly, and we filtered the variables that we need to use from the housing price raw data.
In order to better apply the environmental impact to the housing prices, we use two different ways. For the shooting data, we selected the number of victims who suffered the gun shot on head within 1000 feet around each house because the head wound is the most dangerous body part that caused the death. Also, because the location and time of the crime is random, a certain buffer is appropriate to measure the safety levels in a community. Even a fatal crime in a relatively far distance can cause the panic and affect the housing price. Therefore, we choose the buffer to sum crimes rather than using the average nearest neighbor distance.
The second way is applied to schools, commercial corridors and public parks and recreational centers. We used the method of calculating the Euclidean distance this time because the distance to these public resources are much more important. It is used to represent the convenience of living in a house. The variables for the further analysis are the distances from each house to its nearest public amenity.
nearest_school <- sf::st_nearest_feature(Philly_price_clean, schools.sf)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(schools.sf)
Philly_price_clean$dist_to_school <- rsgeo::distance_euclidean_pairwise(x, y[nearest_school])
nearest_Commercial <- sf::st_nearest_feature(Philly_price_clean, Commercial_Corridors)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(Commercial_Corridors)
Philly_price_clean$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_Commercial])
nearest_PPR <- sf::st_nearest_feature(Philly_price_clean, PPR_Sites)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(PPR_Sites)
Philly_price_clean$dist_to_PPR <- rsgeo::distance_euclidean_pairwise(x, y[nearest_PPR])
For the census variables, we simply joined the data of each block groups to the houses within that block group. We also assigned the neighborhood of each house for the future use and divide the quality condition into four parts: good condition - 4, fair condition - 3, bad condition - 2, others - 1
Philly_Housing_joined <-
st_join(Philly_price_clean, blockgroup21, join = st_within)
Philly_Housing_joined <-
st_join(Philly_Housing_joined, neighborhood, join = st_within)
Table 2.1 gives us a comprehensive summary of the independent variables that might affect the housing prices. The total number of the sample is 23781. Here is a descriptions of the variables:
House condition: - total_area: the area that the owners have, which includes the living area and the outdoor area - houseAge: the age of the house since it was first built.
Demographic characteristics: - MedHHInc: the median household income in the past 12 months (in 2021 inflation-adjusted dollars) - pctForeign: the percentage of residents who were born outside of the United States among all residents - pctMobility: the percentage of geographic mobility in a different house in United States 1 year ago
Environmental impact: - shooting.Buffer: the sum of victims suffered from head gun shot within a 1000 feet buffer of each house - dist_to_school: the distance of a house to the nearest school - dist_to_commerce: the distance of a house to the nearest commercial corridor - dist_to_PPR: the distance of a house to the nearest public park and recreational center
The table calculates the maximum, minimum, mean, and standard deviation of each variable, which indicates a large variability of the data from the standard deviation. We need to consider if there are any outliers for the future analysis.
numeric_columns <- sapply(Philly_Housing_joined, is.numeric)
for (col in names(Philly_Housing_joined)[numeric_columns]) {
col_mean <- mean(Philly_Housing_joined[[col]], na.rm = TRUE)
Philly_Housing_joined[[col]][is.na(Philly_Housing_joined[[col]])] <- col_mean
}
house_condition <- st_drop_geometry(Philly_Housing_joined) %>%
select(total_area, houseAge, interior_condition) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
demo <- st_drop_geometry(Philly_Housing_joined) %>%
select(MedHHInc, pctForeign, pctMobility) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
envi_impact <- st_drop_geometry(Philly_Housing_joined) %>%
select(shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
summary_stats <- rbind(house_condition, demo, envi_impact)
summary_table <- knitr::kable(summary_stats, "html", digits = 1, caption = "Table of summary statistics") %>%
kable_styling(full_width = F)
grouped_table <- group_rows(summary_table, "House Condition", 1, 3)
grouped_table <- group_rows(grouped_table, "Demographic Characteristics", 4, 6)
grouped_table <- group_rows(grouped_table, "Environmental Impact", 7, 10)
grouped_table
| variable | mean | sd | Min | Max |
|---|---|---|---|---|
| House Condition | ||||
| houseAge | 85.4 | 29.7 | 0.0 | 273.0 |
| interior_condition | 3.6 | 0.9 | 0.0 | 7.0 |
| total_area | 1828.9 | 3813.0 | 0.0 | 226512.0 |
| Demographic Characteristics | ||||
| MedHHInc | 65581.2 | 31777.9 | 6302.0 | 250001.0 |
| pctForeign | 0.1 | 0.1 | 0.0 | 0.8 |
| pctMobility | 0.1 | 0.1 | 0.0 | 0.8 |
| Environmental Impact | ||||
| dist_to_PPR | 1796.5 | 1032.0 | 79.9 | 10317.7 |
| dist_to_commerce | 670.0 | 638.3 | 0.0 | 7385.7 |
| dist_to_school | 1147.7 | 647.6 | 32.4 | 5900.8 |
| shooting.Buffer | 3.4 | 2.5 | 1.0 | 32.0 |
Figure 2.1 is a correlation matrix among the numeric variables, which represents the pairwise degree of correlation between two variables in x-axis and y-axis. The correlation coefficient represents the positive, negative, or no correlation. A positive correlation means that when the value of one variable increases, the value of the other variable increases and vice versa. 0 means no correlation, and the change in one variable do not predict changes in the other. Most of the variable pairs have a 0 correlation or a slightly negative correlation, but few pairs, such as distance to school with distance to parks, are highly positively correlated.
numericVars <- st_drop_geometry(Philly_Housing_joined)[, c("total_area", "houseAge", "interior_condition", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")]
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation matrix across variables",
label = "Figure 2.1")
This scatterplot (Figure 2.2) indicates a relationship between the home price and the median household income. The blue line is a line of best fit, which has a positive but slow slope. As the median household income increases, the house price also increases, but it would not increase a lot. Points are also concentrated in the bottom left corner, meaning that households with lower income tend to live in houses with lower prices. However, the data is more scattered for households with high income. Their home prices vary across the y-axis. There are also three outliers which have a price near 600 millions.
ggplot(Philly_Housing_joined, aes(x = MedHHInc, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Price as a function of median household income",
x = "Median Household Income",
y = "Home Price",
label = "Figure 2.2"
) +
theme_minimal()
### Home price correlation scatterplot with house age Figure 2.3 is a
scatterplot showing the correlation between the house ages and the house
prices. The best fit line shows that as the house ages increase, the
house prices gently decreases. There is also a large variety in the home
price at a specific house age. We might consider if house age is a good
predictor of the house value.
ggplot(Philly_Housing_joined, aes(x = houseAge, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Price as a function of house age",
x = "Age",
y = "Home Price",
label = "Figure 2.3"
) +
theme_minimal()
### Home Price Correlation Scatterplot with total areas Figure 2.4 shows
the correlation between the total areas of the house and the house
price. There is a strong positive relationship as the slope of the best
fit line is greater than the previous two plots. The house prices
increase as the total areas are larger. However, there is still a
variety of house prices when the total areas are relatively small. For
example, there are some outliers that a small area of house has
significant highest prices.
ggplot(Philly_Housing_joined, aes(x = total_area, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(
title = "Price as a function of total areas",
x = "total areas",
y = "Home Price",
label = "Figure 2.4"
) +
theme_minimal()
### Home Price Correlation with Distance to Parks and Recreation Figure
2.5 shows the correlation between the distance to the nearest parks and
recreation and the house price. The slope of the best fit line is not
obvious, which is almost horizontal to the x-axis. It means that no
matter how far the distances to parks and recreation center change, the
house prices are not affected by them too much.
ggplot(Philly_Housing_joined, aes(x = dist_to_PPR, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(
title = "Price as a function of distance to Parks and Recreation",
x = "Distance to Parks and Recreation",
y = "Home Price",
label = "Figure 2.5"
) +
theme_minimal()
### Develop 1 map of dependent variable (sale price) Figure 2.6 is a
choropleth map of the sales price distribution by block groups in
Philadelphia. The housing value is grouped in equal intervals, and
lighter blue represents the lower sale prices, while the darker blue
represents the higher sale prices. Most of the neighborhoods have a
relative low prices less than 1000000. Prices are higher in the
northwest block groups as well as the downtown areas.
Philly_price_Model <- Philly_Housing_joined %>%
filter(toPredict == "MODELLING")
mean_sale_price <- Philly_price_Model %>%
group_by(GEOID) %>%
summarize(MeanSalePrice = mean(sale_price))
blockgroup21 <- st_join(blockgroup21, mean_sale_price, by = "GEOID")
blockgroup21$MeanSalePrice[is.na(blockgroup21$MeanSalePrice)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSalePrice)) +
scale_fill_gradient(low = "lightblue", high = "darkblue",
name = "Sale Price (U.S. Dollars)") +
labs(title = "Choropleth Map of Housing Sale Price by neighborhoods",
label = "Figure 2.6") +
theme_minimal()
### Develop a map of most interesting independent variables: Geographic
Mobility The map of the geographic mobility by block groups shows a
scattered pattern. There is no cluster that several adjacent block
groups have the highest or the lowest mobility rate. However, there is
one block group in the dark blue which has a significant high mobility
rate.
ggplot() +
geom_sf(data = blockgroup21, aes(fill = pctMobility)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Geographic Mobility Rate") +
labs(title = "Choropleth Map of Geographic Mobility by Block Groups",
label = "Figure 2.7") +
theme_minimal()
### Develop map of most interesting independent variables: School
Density The map shows the average distance to the nearest school by
block groups. It is obvious to see that the south Philadelphia has a
shorter average distance and the distance is longer in the most blocks
of the north Philadelphia. Block groups in downtown Philadelphia have
similar distances from 2000 to 3000 feet.
mean_school_dis <- Philly_Housing_joined %>%
group_by(GEOID) %>%
summarize(MeanSchoolDis = mean(dist_to_school))
blockgroup21 <- st_join(blockgroup21, mean_school_dis, by = "GEOID")
blockgroup21$MeanSchoolDis[is.na(blockgroup21$MeanSchoolDis)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSchoolDis)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Average Distance") +
labs(title = "Choropleth Map of Distance to school by Block Groups",
label = "Figure 2.8") +
theme_minimal()
The heat map shows the general density of victims who suffered from head gun shot wounds. It is clear to see that the incidents happened mostly in the downtown and west of Philadelphia.
ggplot() + geom_sf(data = blockgroup21, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(PhillyShootingHead.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "lightblue", high = "purple", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
labs(title = "Density map of Shootings in head distribution, Philadelphia",
label = "Figure 2.9") +
mapTheme()
## III. Methods ### 3.1 Coding techniques: We made our OLS regression
analysis in R programming language. The use of packages helped us create
data visualizations and statistical calculation, including kable,
ggplot, stargazer, etc. We combined our code and report into an R
markdown file, which can be knitted into an html file eventually.
In part II, we introduced our datasets and provided a general summary of the statistics of our predictors. The correlation matrix, scatterplots and geospatial visualization created a multidimensional way to understand the predictors. A comprehensive table of summary statistics was produced to provide insights into the distribution of all variables in the dataset. These variables were sorted into categories for easy reference: internal characteristics, amenities/public services, and spatial structure.
First of all, we separated the data which contains the actual sale prices with the data which needs to be predicted. For the modelling data, we split it into training and test sets in the proportion of 3:2. The training data is for the machine learning algorithm to learn and make the model with the minimum error, while the test data is used to evaluate the model’s performance for predicting the house values. Therefore, we made a linear regression model using the lm function in R, and the values in the model will be interpreted in the Results section. Mean absolute error is also calculated for the test set to measure the accuracy of the model.
Second, we went through a 100-fold cross validation to enhance the model’s performance. It split the test data into 100 parts equally and randomly to train on k-1 of the folds repeatedly. To interpret the mean absolute error and the correlation between the predicted results and the actual sale prices in a more visualized way, we created a histogram plot, a scattered plot, and a map to show its distribution.
Third, we did a Moran’s I test to test the spatial autocorrelation. The plot shows the value of Moran’s I and the frequency of 999 random permutated I. A scatterplot of spatial lag of errors and prediction errors is also displayed. It is used to show if the spatial autocorrelation in the errors exist. With the above model, we put our unpredicted data into the regression model and showed the results in a map.
The last part of the analysis is the generalizability of neighborhood model, which is to test the generalizability across the space. To achieve this, we first ran another regression by each neighborhood and created a map to visualize the mean absolute percent errors by neighborhood. We also did a testing using the Census data of median household income and foreign-born residents to categorize the neighborhoods and making tables to compare the mean absolute percent errors.
inTrain <- createDataPartition(
y = paste(Philly_price_Model$interior_condition),
p = .60, list = FALSE)
PhillyPrice.training <- Philly_price_Model[inTrain,]
PhillyPrice.test <- Philly_price_Model[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))
summary_lm <- summary(reg.training)
# Create a table of the model summary using stargazer
stargazer(reg.training,
title = "Linear Regression Summary",
type = "html",
align = TRUE
)
##
## <table style="text-align:center"><caption><strong>Linear Regression Summary</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>sale_price</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">total_area</td><td>18.273<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.530)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">interior_condition</td><td>-69,775.410<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1,920.490)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">houseAge</td><td>81.003</td></tr>
## <tr><td style="text-align:left"></td><td>(57.929)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">shooting.Buffer</td><td>-5,647.677<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(643.075)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">dist_to_school</td><td>3.757</td></tr>
## <tr><td style="text-align:left"></td><td>(2.677)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">dist_to_commerce</td><td>-29.525<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(2.647)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">dist_to_PPR</td><td>3.785<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.668)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">MedHHInc</td><td>2.937<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">pctForeign</td><td>84,816.050<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(12,911.530)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">pctMobility</td><td>237,454.600<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(15,012.680)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>282,987.600<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(10,368.420)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>14,272</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.380</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.380</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>186,264.600 (df = 14261)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>874.741<sup>***</sup> (df = 10; 14261)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
PhillyPrice.test <-
PhillyPrice.test %>%
mutate(sale_price.Predict = predict(reg.training, PhillyPrice.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)
error_summary <- st_drop_geometry(PhillyPrice.test) %>%
summarize(
MAE = mean(sale_price.AbsError),
MAPE = mean(sale_price.APE)
)
kable(error_summary, "html", caption = "Error Summary for Test Set", digits = 2) %>%
kable_styling()
| MAE | MAPE |
|---|---|
| 104014.2 | 0.39 |
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(Philly_price_Model) %>%
dplyr::select(sale_price,
total_area, interior_condition,
houseAge, shooting.Buffer, dist_to_school,
dist_to_commerce, dist_to_PPR, MedHHInc,
pctForeign, pctMobility),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 23781 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 23542, 23542, 23544, 23544, 23544, 23542, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 180808.5 0.4266118 103295.6
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
mean(reg.cv$resample[,3])
## [1] 103295.6
sd(reg.cv$resample[,3])
## [1] 11210.03
cv_results <- data.frame(Resamples = names(reg.cv$resample),
MAE = reg.cv$resample$MAE)
# Create a histogram of MAE values
ggplot(cv_results, aes(x = MAE)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
labs(title = "Cross-Validation Mean Absolute Error (MAE) Histogram",
x = "Mean Absolute Error (MAE)",
y = "Frequency")
scatter_plot <- ggplot(data = PhillyPrice.test, aes(x = sale_price.Predict, y = sale_price)) +
geom_point() + # Add scatter points
geom_abline(intercept = 0, slope = 1, color = "orange", linetype = "dashed") + # Perfect fit line
geom_abline(intercept = mean(PhillyPrice.test$sale_price), slope = 1, color = "green", linetype = "dashed") + # Average predicted fit line
labs(title = "Scatter Plot of SalePrice vs. SalePrice.Predict",
x = "SalePrice.Predict",
y = "SalePrice")
# Print the scatter plot
print(scatter_plot)
### Provide a map of your residuals for your test set.
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
ggplot() +
geom_sf(data = blockgroup21, fill = "grey40") +
geom_sf(data = PhillyPrice.test, aes(colour = q5(sale_price.Error)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Map of Residuals of test set") +
mapTheme()
### Moran’s I test
coords <- st_coordinates(Philly_price_Model)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Philly_price_Model$lagPrice <- lag.listw(spatialWeights, Philly_price_Model$sale_price)
coords.test <- st_coordinates(PhillyPrice.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
PhillyPrice.test <- PhillyPrice.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error))
moranTest <- moran.mc(PhillyPrice.test$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
### Scatter plot of error as a function of the spatial lag of price
ggplot(PhillyPrice.test, aes(x = lagPriceError, y = sale_price.Error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "orange")+
labs(title = "Error as a function of the spatial lag of errors",
x = "lagPriceError",
y = "sale_price.Error")
### Provide a map of your predicted values for where ‘toPredict’ is both
“MODELLING” and “CHALLENGE”.
Philly_Housing_joined <-
Philly_Housing_joined %>%
mutate(sale_price.Predict = predict(reg.training, Philly_Housing_joined))
ggplot() +
geom_sf(data = neighborhood, fill = "grey40") +
geom_sf(data = Philly_Housing_joined, aes(colour = q5(sale_price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Map of predicted values") +
mapTheme()
### Using the test set predictions, provide a map of mean absolute
percentage error (MAPE) by neighborhood.
left_join(
st_drop_geometry(PhillyPrice.test) %>%
group_by(name) %>%
summarize(meanPrice = mean(sale_price, na.rm = T)),
mutate(PhillyPrice.test, predict.fe =
predict(lm(sale_price ~ name, data = PhillyPrice.test),
PhillyPrice.test)) %>%
st_drop_geometry %>%
group_by(name) %>%
summarize(meanPrediction = mean(predict.fe)))
## # A tibble: 145 × 3
## name meanPrice meanPrediction
## <chr> <dbl> <dbl>
## 1 ACADEMY_GARDENS 307321. 307321.
## 2 ALLEGHENY_WEST 80371. 80371.
## 3 ANDORRA 382722. 382722.
## 4 ASTON_WOODBRIDGE 287730 287730.
## 5 BARTRAM_VILLAGE 125588. 125588.
## 6 BELLA_VISTA 596240. 596240.
## 7 BELMONT 127709. 127709.
## 8 BREWERYTOWN 238274. 238274.
## 9 BRIDESBURG 207231. 207231.
## 10 BURHOLME 250200 250200.
## # ℹ 135 more rows
reg.nhood <- lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))
PhillyPrice.test.nhood <-
PhillyPrice.test %>%
mutate(Regression = "Neighborhood Effects",
sale_price.Predict = predict(reg.nhood, PhillyPrice.test),
sale_price.Error = sale_price.Predict- sale_price,
sale_price.AbsError = abs(sale_price.Predict- sale_price),
sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)
mape_neighborhood <- PhillyPrice.test.nhood %>%
group_by(name) %>%
summarize(MAPE = mean(sale_price.APE, na.rm = TRUE))
neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")
ggplot(data = neighborhood) +
geom_sf(aes(fill = MAPE)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "MAPE") +
labs(title = "Mean Absolute Percentage Error by Neighborhood") +
theme_minimal()
mean_price_summary <- PhillyPrice.test.nhood %>%
group_by(name) %>%
summarize(meanPrice = mean(sale_price, na.rm = TRUE))
neighborhood <- st_join(neighborhood, mean_price_summary, by = "name")
ggplot(data = neighborhood, aes(x = meanPrice, y = MAPE)) +
geom_point(alpha = 0.7) + # color by neighborhood for distinction
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
labs(title = "MAPE by Neighborhood as a Function of Mean Price") +
theme_minimal()
### split your study area into two groups (perhaps by race or income)
and test your model’s generalizability
blockgroup21 <- blockgroup21 %>%
mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(blockgroup21), aes(fill = AfricanContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="African-American Context") +
labs(title = "African-American Context") +
mapTheme() + theme(legend.position="bottom"))
# B02001_003 African American
PhillyPrice.test <- PhillyPrice.test %>%
mutate(Regression = "Baseline Regression")
# #dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name)
# dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name)
bothRegressions <-
rbind(
dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name),
dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name))
st_join(bothRegressions, blockgroup21) %>%
group_by(Regression, AfricanContext) %>%
summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(AfricanContext, mean.MAPE) %>%
kable("html", caption = "Test set MAPE by neighborhood racial context") %>%
kable_styling(full_width = F)
| Regression | Majority African-American | Majority non African-American |
|---|---|---|
| Baseline Regression | 37% | 40% |
| Neighborhood Effects | 57% | 69% |